SC RNA-Seq: Checking for duplicates with dupRadar
SC RNA-Seq: Checking for duplicates with dupRadar
Libraries required
Loading required package: viridisLite
Bam files (marked duplicates)
Analyze duplicates
dm <- NULL
if (!file.exists("./output/analyzedDuplicates.rds")) {
dm <- lapply(bamFiles, function(x) {
a <- analyzeDuprates(bam = x, gtf = "./input/gencode.vM18.gff3", stranded = 0,
paired = F, threads = detectCores() - 1)
# removing versions from gene IDs
a$ID <- as.character(sapply(as.character(a$ID), function(y) strsplit(y,
"\\.")[[1]][1]))
return(a)
})
saveRDS(object = dm, file = "./output/analyzedDuplicates.rds")
} else {
dm <- readRDS("./output/analyzedDuplicates.rds")
}Plotting and interpretation
The number of reads per base assigned to a gene in an ideal RNA-Seq data set is expected to be proportional to the abundance of its transcripts in the sample. For lowly expressed genes we expect read duplication to happen rarely by chance, while for highly expressed genes - depending on the total sequencing depth - we expect read duplication to happen often.
A good way to learn if a dataset is following this trend is by relating the normalized number of counts per gene (RPK, as a quantification of the gene expression) and the fraction represented by duplicated reads.
A duprate plot (blue cloud)
for (i in seq_along(dm)) {
name <- names(dm)[i]
cat("\n \n")
cat(paste("###", name))
cat("\n \n")
duprateExpDensPlot(DupMat = dm[[i]], pal = viridis(n = 1000), main = name)
cat("\n \n")
}Adult_01
Adult_03
Adult_11
Adult_13
Adult_15
Adult_17
PND15_01
PND15_02
PND15_03
PND15_04
PND15_05
PND15_06
PND15_07
PND15_08
PND15_09
PND8_01
PND8_02
PND8_03
PND8_04
PND8_05
PND8_06
PND8_07
PND8_08
Duprate Boxplot
The duprateExpBoxplot plot shows the range of the duplication rates at 5% bins (default) along the distribution of RPK gene counts. The x-axis displays the quantile of the RPK distribution, and the average RPK of the genes contained in this quantile.
for (i in seq_along(dm)) {
name <- names(dm)[i]
cat("\n \n")
cat(paste("###", name))
cat("\n \n")
duprateExpBoxplot(DupMat = dm[[i]], main = name)
cat("\n \n")
}Adult_01
Adult_03
Adult_11
Adult_13
Adult_15
Adult_17
PND15_01
PND15_02
PND15_03
PND15_04
PND15_05
PND15_06
PND15_07
PND15_08
PND15_09
PND8_01
PND8_02
PND8_03
PND8_04
PND8_05
PND8_06
PND8_07
PND8_08
Read counts expression
for (i in seq_along(dm)) {
name <- names(dm)[i]
cat("\n \n")
cat(paste("###", name))
cat("\n \n")
readcountExpBoxplot(DupMat = dm[[i]])
cat("\n \n")
}Adult_01
Adult_03
Adult_11
Adult_13
Adult_15
Adult_17
PND15_01
PND15_02
PND15_03
PND15_04
PND15_05
PND15_06
PND15_07
PND15_08
PND15_09
PND8_01
PND8_02
PND8_03
PND8_04
PND8_05
PND8_06
PND8_07
PND8_08
Read counts expression histogram
for (i in seq_along(dm)) {
name <- names(dm)[i]
cat("\n \n")
cat(paste("###", name))
cat("\n \n")
expressionHist(DupMat = dm[[i]])
cat("\n \n")
}Adult_01
Adult_03
Adult_11
Adult_13
Adult_15
Adult_17
PND15_01
PND15_02
PND15_03
PND15_04
PND15_05
PND15_06
PND15_07
PND15_08
PND15_09
PND8_01
PND8_02
PND8_03
PND8_04
PND8_05
PND8_06
PND8_07
PND8_08
Comparison of multi-mapping RPK and uniquely-mapping RPK
for (i in seq_along(dm)) {
name <- names(dm)[i]
cat("\n \n")
cat(paste("###", name))
cat("\n \n")
plot(log2(dm[[i]]$RPK), log2(dm[[i]]$RPKMulti), xlab = "Reads per kb (uniquely mapping reads only)",
ylab = "Reads per kb (all including multimappers, non-weighted)")
cat("\n \n")
}Adult_01
Adult_03
Adult_11
Adult_13
Adult_15
Adult_17
PND15_01
PND15_02
PND15_03
PND15_04
PND15_05
PND15_06
PND15_07
PND15_08
PND15_09
PND8_01
PND8_02
PND8_03
PND8_04
PND8_05
PND8_06
PND8_07
PND8_08
Connection between possible PCR artefacts and GC content
## set up biomart connection for mouse (needs internet connection)
ensm <- useMart("ensembl")
ensm <- useDataset("mmusculus_gene_ensembl", mart = ensm)
## get a table which has the gene GC content for the IDs that have been used
## to generate the table
tr <- getBM(attributes = c("ensembl_gene_id", "percentage_gene_gc_content"),
values = TRUE, mart = ensm)
## create a GC vector with IDs as element names
mgi.gc <- tr$percentage_gene_gc_content
names(mgi.gc) <- tr$ensembl_gene_idCompare the dependence of duplication rate on expression level independently for below and above median GC genes
for (i in seq_along(dm)) {
name <- names(dm)[i]
cat("\n \n")
cat(paste("###", name))
cat("\n \n")
keep <- dm[[i]]$ID %in% tr$ensembl_gene_id
dm.gc <- dm[[i]][keep, ]
dm.gc$gc <- mgi.gc[dm.gc$ID]
par(mfrow = c(1, 2))
## below median GC genes
duprateExpDensPlot(dm.gc[dm.gc$gc <= 45, ], main = "below median GC genes")
## above median GC genes
duprateExpDensPlot(dm.gc[dm.gc$gc >= 45, ], main = "above median GC genes")
cat("\n \n")
}Adult_01
Adult_03
Adult_11
Adult_13
Adult_15
Adult_17
PND15_01
PND15_02
PND15_03
PND15_04
PND15_05
PND15_06
PND15_07
PND15_08
PND15_09
PND8_01
PND8_02
PND8_03
PND8_04
PND8_05
PND8_06
PND8_07
PND8_08
SessionInfo
─ Session info ──────────────────────────────────────────────────────────
setting value
version R version 3.6.1 (2019-07-05)
os Ubuntu 16.04.6 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Zurich
date 2019-08-19
─ Packages ──────────────────────────────────────────────────────────────
package * version date lib source
AnnotationDbi 1.46.0 2019-05-02 [1] Bioconductor
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1)
backports 1.1.4 2019-04-10 [1] CRAN (R 3.6.1)
Biobase 2.44.0 2019-05-02 [1] Bioconductor
BiocGenerics 0.30.0 2019-05-02 [1] Bioconductor
biomaRt * 2.40.3 2019-07-17 [1] Bioconductor
bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.1)
bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.1)
bitops 1.0-6 2013-08-17 [1] CRAN (R 3.6.1)
blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1)
bookdown 0.12 2019-07-11 [1] CRAN (R 3.6.1)
callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.1)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1)
curl 4.0 2019-07-22 [1] CRAN (R 3.6.1)
DBI 1.0.0 2018-05-02 [1] CRAN (R 3.6.1)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1)
devtools 2.1.0 2019-07-06 [1] CRAN (R 3.6.1)
digest 0.6.20 2019-07-04 [1] CRAN (R 3.6.1)
dplyr 0.8.3 2019-07-04 [1] CRAN (R 3.6.1)
dupRadar * 1.14.0 2019-05-02 [1] Bioconductor
evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1)
formatR 1.7 2019-06-11 [1] CRAN (R 3.6.1)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1)
ggplot2 3.2.1 2019-08-10 [1] CRAN (R 3.6.1)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.1)
gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1)
highr 0.8 2019-03-20 [1] CRAN (R 3.6.1)
hms 0.5.0 2019-07-09 [1] CRAN (R 3.6.1)
htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.6.1)
httpuv 1.5.1 2019-04-05 [1] CRAN (R 3.6.1)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1)
IRanges 2.18.1 2019-05-31 [1] Bioconductor
KernSmooth 2.23-15 2015-06-29 [1] CRAN (R 3.6.1)
knitr * 1.24 2019-08-08 [1] CRAN (R 3.6.1)
later 0.8.0 2019-02-11 [1] CRAN (R 3.6.1)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.1)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.1)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.1)
mime 0.7 2019-06-11 [1] CRAN (R 3.6.1)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.6.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.1)
pillar 1.4.2 2019-06-29 [1] CRAN (R 3.6.1)
pkgbuild 1.0.4 2019-08-05 [1] CRAN (R 3.6.1)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.6.1)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.1)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.1)
processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1)
progress 1.2.2 2019-05-16 [1] CRAN (R 3.6.1)
promises 1.0.1 2018-04-13 [1] CRAN (R 3.6.1)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.1)
purrr 0.3.2 2019-03-15 [1] CRAN (R 3.6.1)
questionr 0.7.0 2018-11-26 [1] CRAN (R 3.6.1)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.6.1)
Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.6.1)
RCurl 1.95-4.12 2019-03-04 [1] CRAN (R 3.6.1)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1)
rlang 0.4.0 2019-06-25 [1] CRAN (R 3.6.1)
rmarkdown 1.14 2019-07-12 [1] CRAN (R 3.6.1)
rmdformats * 0.3.5 2019-02-19 [1] CRAN (R 3.6.1)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.1)
RSQLite 2.1.2 2019-07-24 [1] CRAN (R 3.6.1)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.1)
Rsubread 1.34.6 2019-07-19 [1] Bioconductor
S4Vectors 0.22.0 2019-05-02 [1] Bioconductor
scales 1.0.0 2018-08-09 [1] CRAN (R 3.6.1)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1)
shiny 1.3.2 2019-04-22 [1] CRAN (R 3.6.1)
stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.1)
stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.1)
testthat 2.2.1 2019-07-25 [1] CRAN (R 3.6.1)
tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.1)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.1)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
vctrs 0.2.0 2019-07-05 [1] CRAN (R 3.6.1)
viridis * 0.5.1 2018-03-29 [1] CRAN (R 3.6.1)
viridisLite * 0.3.0 2018-02-01 [1] CRAN (R 3.6.1)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.1)
xfun 0.8 2019-06-25 [1] CRAN (R 3.6.1)
XML 3.98-1.20 2019-06-06 [1] CRAN (R 3.6.1)
xtable 1.8-4 2019-04-21 [1] CRAN (R 3.6.1)
yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.1)
zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.1)
[1] /home/ubuntu/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library